Introduction

Question of interest

Alzheimer’s disease, the most common cause of dementia, involves both genetic and environmental factors, with familial Alzheimer’s disease (fAD) often linked to mutations in genes like PSEN1 [1]. Studying these mutations is crucial as they not only affect gene expression in the brain but also offer insights into the broader mechanisms of Alzheimer’s, enhancing our understanding of both familial and sporadic forms of the disease [1]. PSEN1 mutations are critical because they directly impact the processing of amyloid precursor protein (APP) by the γ-secretase complex, leading to the production of amyloid beta (Aβ) peptides [2]. When these mutations occur, they can enhance or disrupt the normal cleavage of APP, resulting in an increased accumulation of toxic amyloid plaques in the brain, a hallmark of Alzheimer’s disease [2]. This aberrant processing is linked to early-onset forms of the disease, underscoring the importance of PSEN1 in Alzheimer’s pathogenesis [2]. The L150P mutation in the PSEN1 gene is a specific point mutation, where the normal leucine (L) at position 150 is replaced by proline (P), leading to alterations in the protein’s function and contributing to the disease’s early onset and progression [3]. My question of interest is, how do specific familial Alzheimer’s disease (fAD) mutations in the PSEN1 gene affect the gene expression in glutamatergic forebrain neurons, and how might these changes contribute to the pathogenesis of Alzheimer’s disease?

Hypothesis

Familial Alzheimer’s disease (fAD) mutations in the PSEN1 gene lead to significant alterations in the gene expression profiles of glutamatergic forebrain neurons, with a particular impact on the expression of genes related to the extracellular matrix. These genetic alterations manifest not only in the upregulation and downregulation of key neuronal genes but also prominently influence the composition and function of the extracellular matrix. Changes in this matrix are hypothesized to disrupt cellular communication and integrity, subsequently influencing several pathways pivotal to the pathogenesis of Alzheimer’s disease. This focus on the extracellular matrix could provide insights into the structural and functional disruptions that contribute to the disease’s progression, offering potential targets for therapeutic intervention.

This hypothesis will be tested through comprehensive RNA-seq analysis, comparing gene expression profiles from glutamatergic forebrain neurons derived from individuals with fAD-related PSEN1 mutation (L150P) to those from matched isogenic controls. Subsequent bioinformatics analyses will focus on identifying significantly affected genes and pathways.

Results

Key insights

Comprehensive RNA-seq analysis identified a total of 4,384 significantly altered genes (|log2FoldChange| > 1) in glutamatergic forebrain neurons derived from individuals with a familial Alzheimer’s disease (fAD)-related PSEN1 mutation (L150P), compared to matched isogenic controls. Of these, 2,942 were protein-coding genes, with notable genes including PAX3, TTR, TFAP2B, NEUROD6, SLC13A4, SLC22A6, LEFTY2, PRPH, FZD10, BARHL2, MFRP, ITGA11, VIT, SERPINB12, WNT1, and NEUROD4 showing top differential expression.

Gene Ontology (GO) enrichment analysis highlighted significant dysregulation in biological processes and cellular components related to the extracellular matrix (ECM), confirming our hypothesis. Key GO terms such as extracellular matrix organization (GO:0030198, p.adjust = 8.46e-36), collagen-containing extracellular matrix (GO:0062023, p.adjust = 5.74e-66), and extracellular structure organization (GO:0043062, p.adjust = 7.75e-36) were among the most significantly enriched. These categories encompassed genes known for their roles in ECM structure and function.

Among the highlighted genes, ITGA11 stands out for its crucial role in cell-matrix adhesion and is involved in the structural organization of the ECM. The WNT1 and FZD10 genes are part of the Wnt signaling pathway, crucial for the maintenance and repair of this matrix, influencing how cells interact with their surroundings. TTR, while primarily involved in transporting thyroid hormones and vitamin A, also plays a role in removing amyloid-beta proteins that can disrupt the matrix structure in Alzheimer’s. Lastly, SERPINB12 helps regulate the breakdown of other proteins, indirectly maintaining the matrix’s integrity.

Furthermore, terms related to the external encapsulating structure and connective tissue development showed significant enrichment, implicating these structural components in the disease pathology of Alzheimer’s related to the PSEN1 mutation. The results suggest that the PSEN1 (L150P) mutation exerts profound effects on the expression of genes associated with the extracelullar membrane, potentially disrupting cellular communication and integrity, which are critical in Alzheimer’s disease pathogenesis.

Future experiments and analyses

To address the limitations posed by the homogeneity of the current sample set, future experiments should include a more diverse cohort. This would involve collecting samples from multiple individuals with varying genetic backgrounds, ages, and both genders. Such diversity will enhance the generalizability of the findings and help elucidate the broader implications of PSEN1 mutations across different populations.

To capture cellular heterogeneity within glutamatergic forebrain neurons, single-cell RNA sequencing (scRNA-seq) could be employed. This method would allow for the dissection of the cellular composition of the brain region and the identification of specific cell types that are most susceptible to PSEN1 mutations.

Integrating RNA-seq data with proteomic and metabolomic analyses could provide a more comprehensive view of the molecular alterations due to fAD mutations. This multi-omics approach would help correlate transcriptomic changes with alterations in protein expression and metabolic pathways, offering a holistic view of the disease mechanisms.

Employing computational models to simulate the impact of genetic mutations on neuronal networks could provide insights into how alterations in gene expression affect brain function. Such models could help predict the clinical outcomes of genetic mutations, aiding in the development of targeted therapies.

Limitations

Both the control and non-control samples are derived from a single 58-year-old male. This severely limits the genetic and phenotypic diversity represented in the data. The results obtained may not be generalizable across different ages, sexes, or genetic backgrounds, which are critical factors in the study of neurodegenerative diseases like Alzheimer’s.

The small sample size (5 fAD samples and 5 controls) restricts the statistical power of the study and may not capture the variability present in the general population. This limitation could affect the reliability of the conclusions drawn from the data.

Given that samples are processed in batches, there is a risk of introducing non-biological variations that can influence the results. Without appropriate controls from diverse backgrounds, it’s challenging to distinguish these effects from genuine biological differences.

Focusing solely on specific mutations within the PSEN1 gene limits the scope of the study. The findings might not extend to other variants within the gene or to other genes implicated in Alzheimer’s disease.

Methods

A detailed verbose description of all the steps you took to arrive at the conclusion including how and where the data was downloaded, pre-processed and analyzed. This should also include some brief reasoning of why you chose certain tools/solutions and what the results of the QC tell you about the data at hand. ## Details regarding the data -My data is deposited in GEO as GSE211993

-I have downloaded my raw sequences from SRA (40 FASTQ pair end can be found using ID PRJNA873113)

-My publication is The transcriptomic landscape of neurons carrying PSEN1 mutations reveals changes in extracellular matrix components and non-coding gene expression

-The data were generated by researchers involved in the studies cited, including Tubsuwan et al. (2016), Li et al. (2016), Pires et al. (2016), Poon et al. (2016), Shi et al. (2012), and Haukedal et al. (2022), among others involved in the generation, differentiation, and characterization of hiPSCs and the subsequent RNA sequencing and analysis.

-RNA data was extracted using the RNeasy® Plus Mini Kit (Qiagen, 74,134) according to the manufacturer’s protocol.

-Bulk Poly(A) and total RNA libraries were constructed for 5 replicates according to BGI protocols and sequenced with BGIseq (DNBseq, 2 × 100 nt, stranded paired-end).

-Cells of interest were human induced pluripotent stem cells (hiPSCs) were differentiated into glutamatergic forebrain neurons.

-The experimental condition involved the differentiation of hiPSCs into glutamatergic forebrain neurons using a modified version of a protocol exploiting dual SMAD inhibition with LDN-193189 as an analogue for Noggin, targeting mutations PSEN1 L150P and PSEN1 A79V through CRISPR-Cas9 gene editing.

-The sequencing was conducted with BGIseq (DNBseq, 2 × 100 nt, stranded paired-end).

Prerequisite - Conda environment

Create rna_seq_analysis.yaml to create conda environment with all necessary tools:

name: rna_seq_analysis
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - sra-tools=3.1.0
  - fastqc=0.12.1
  - multiqc=1.21
  - cutadapt=4.7
  - bbmap=39.06
  - star=2.7.11b
  - samtools=1.19.2
  - subread=2.0.6
  - wget=1.21.4
  - rseqc=5.0.3

Run the following command to create new environment with specified name:

conda env create -f rna_seq_analysis.yaml

This environment contains all you need for the further analysis, just activate it via:

conda activate rna_seq_analysis

Prerequisite - Interactive shell/slurm

You can either run following scripts in interactive shell on Weill Cornell Cayuga server using:

srun -n1 --pty --partition=angsd_class --mem=40G bash -i

or you can edit scripts to run them using batch scheduler of your choice.

Prerequisite - Script location

All scripts are located at /athena/angsd/scratch/mij4011/homework/project

Prerequisite - Master Script

Master Script can be used to run all the tasks in a series to avoid unnecessary repetitiveness and improve reproducibility.

master.sh:

#!/bin/bash

# Stop on error
set -e

# Activate environment
conda activate rna_seq_analysis

# Step 1: Download Data
bash download_data.sh

# Step 2: Quality Control
bash fastqc.sh

# Step 3: Adapter Trimming
bash cutadapt.sh

# Step 4: rRNA Removal
bash silva_database_download.sh
bash run_bbduk.sh

# Quality control after processing (optional)
bash processed_multiqc.sh

# Step 5: Alignment
bash hg38.sh
bash index.sh
bash star.sh

# Quality check for mapping (optional)
bash mapping_quality.sh

# Assess rRNA removal via RSeQC
bash RSeQC.sh

# Step 6: Feature Counts
bash featurecounts.sh

echo "Analysis Complete"

1. Writing a script to download fastq files.

download_data.sh:

# ! / bin / bash -i

# Define an array of accession numbers from downloaded metadata file from SRA
ACCESSIONS=(
SRR21190736 SRR21190737 SRR21190738 SRR21190739 SRR21190740
SRR21190741 SRR21190742 SRR21190743 SRR21190744 SRR21190745
SRR21190746 SRR21190747 SRR21190748 SRR21190749 SRR21190750
SRR21190751 SRR21190752 SRR21190753 SRR21190754 SRR21190755
SRR21190756 SRR21190757 SRR21190758 SRR21190759 SRR21190760
SRR21190761 SRR21190762 SRR21190763 SRR21190764 SRR21190765
SRR21190766 SRR21190767 SRR21190768 SRR21190769 SRR21190770
SRR21190771 SRR21190772 SRR21190773 SRR21190774 SRR21190775
)

# Create a directory named "fastq" and move into it
mkdir -p fastq
cd fastq

# Loop through the accession numbers and download each one
for ACCESSION in "${ACCESSIONS[@]}"; do
    echo "Downloading $ACCESSION..."
    fasterq-dump "$ACCESSION" --split-files --skip-technical --threads 4
done

echo "Download completed."

# Move back to the original directory
cd ..

2. Quality control with FastQC and MultiQC

fastqc.sh:

#!/bin/bash

FASTQ_DIR="fastq"

# Output directory for FastQC results
OUTPUT_DIR="fastqc_results"

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR" multiqc

# Run FastQC on all FASTQ files in the directory
fastqc -o "$OUTPUT_DIR" -t 40 "$FASTQ_DIR"/*.fastq

# Run MultiQC to combine the FastQC reports
multiqc "$OUTPUT_DIR" -o multiqc

3. Trimming data

The sequencing was performed using BGIseq (DNBseq) technology and libraries were constructed following BGI protocols, the adapters you need to trim will likely be specific to BGI’s sequencing platforms. I have found out that adapter for forward strand is AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA and for backward is AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG. I am going to test it using the following script. adapter_test.sh:

#!/bin/bash

# Adapter sequences
FORWARD_ADAPTER="AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA"
REVERSE_ADAPTER="AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG"
FASTQ_DIR="path/to/fastq_files"

# Loop through all FASTQ files in the specified directory
for FASTQ_FILE in $FASTQ_DIR/*.fastq; do
    # Determine if the file is forward or reverse based on its name
    if [[ $FASTQ_FILE =~ _1.fastq$ ]]; then
        # Search for the forward adapter
        ADAPTER_SEQ=$FORWARD_ADAPTER
        COUNT=$(grep -o "$ADAPTER_SEQ" "$FASTQ_FILE" | wc -l)
        echo "Found $COUNT instances of the forward adapter sequence in $FASTQ_FILE"
    elif [[ $FASTQ_FILE =~ _2.fastq$ ]]; then
        # Search for the reverse adapter
        ADAPTER_SEQ=$REVERSE_ADAPTER
        COUNT=$(grep -o "$ADAPTER_SEQ" "$FASTQ_FILE" | wc -l)
        echo "Found $COUNT instances of the reverse adapter sequence in $FASTQ_FILE"
    else
        echo "Skipping $FASTQ_FILE: not recognized as forward or reverse."
    fi
done

output of testing:


Found 6058 instances of the forward adapter sequence in fastq/SRR21190736_1.fastq
Found 928 instances of the reverse adapter sequence in fastq/SRR21190736_2.fastq
Found 5027 instances of the forward adapter sequence in fastq/SRR21190737_1.fastq
Found 752 instances of the reverse adapter sequence in fastq/SRR21190737_2.fastq
Found 5326 instances of the forward adapter sequence in fastq/SRR21190738_1.fastq
Found 881 instances of the reverse adapter sequence in fastq/SRR21190738_2.fastq
Found 3339 instances of the forward adapter sequence in fastq/SRR21190739_1.fastq
Found 600 instances of the reverse adapter sequence in fastq/SRR21190739_2.fastq
Found 5090 instances of the forward adapter sequence in fastq/SRR21190740_1.fastq
Found 876 instances of the reverse adapter sequence in fastq/SRR21190740_2.fastq
Found 6257 instances of the forward adapter sequence in fastq/SRR21190741_1.fastq
Found 1025 instances of the reverse adapter sequence in fastq/SRR21190741_2.fastq
Found 3346 instances of the forward adapter sequence in fastq/SRR21190742_1.fastq
Found 553 instances of the reverse adapter sequence in fastq/SRR21190742_2.fastq
Found 4611 instances of the forward adapter sequence in fastq/SRR21190743_1.fastq
Found 663 instances of the reverse adapter sequence in fastq/SRR21190743_2.fastq
Found 4848 instances of the forward adapter sequence in fastq/SRR21190744_1.fastq
Found 858 instances of the reverse adapter sequence in fastq/SRR21190744_2.fastq
Found 4369 instances of the forward adapter sequence in fastq/SRR21190745_1.fastq
Found 1021 instances of the reverse adapter sequence in fastq/SRR21190745_2.fastq
Found 7493 instances of the forward adapter sequence in fastq/SRR21190746_1.fastq
Found 1434 instances of the reverse adapter sequence in fastq/SRR21190746_2.fastq
Found 5330 instances of the forward adapter sequence in fastq/SRR21190747_1.fastq
Found 982 instances of the reverse adapter sequence in fastq/SRR21190747_2.fastq
Found 6193 instances of the forward adapter sequence in fastq/SRR21190748_1.fastq
Found 1116 instances of the reverse adapter sequence in fastq/SRR21190748_2.fastq
Found 5028 instances of the forward adapter sequence in fastq/SRR21190749_1.fastq
Found 905 instances of the reverse adapter sequence in fastq/SRR21190749_2.fastq
Found 5482 instances of the forward adapter sequence in fastq/SRR21190750_1.fastq
Found 869 instances of the reverse adapter sequence in fastq/SRR21190750_2.fastq
Found 4665 instances of the forward adapter sequence in fastq/SRR21190751_1.fastq
Found 689 instances of the reverse adapter sequence in fastq/SRR21190751_2.fastq
Found 4057 instances of the forward adapter sequence in fastq/SRR21190752_1.fastq
Found 770 instances of the reverse adapter sequence in fastq/SRR21190752_2.fastq
Found 5233 instances of the forward adapter sequence in fastq/SRR21190753_1.fastq
Found 909 instances of the reverse adapter sequence in fastq/SRR21190753_2.fastq
Found 3782 instances of the forward adapter sequence in fastq/SRR21190754_1.fastq
Found 749 instances of the reverse adapter sequence in fastq/SRR21190754_2.fastq
Found 3413 instances of the forward adapter sequence in fastq/SRR21190755_1.fastq
Found 633 instances of the reverse adapter sequence in fastq/SRR21190755_2.fastq
Found 2812 instances of the forward adapter sequence in fastq/SRR21190756_1.fastq
Found 358 instances of the reverse adapter sequence in fastq/SRR21190756_2.fastq
Found 1054 instances of the forward adapter sequence in fastq/SRR21190757_1.fastq
Found 113 instances of the reverse adapter sequence in fastq/SRR21190757_2.fastq
Found 2271 instances of the forward adapter sequence in fastq/SRR21190758_1.fastq
Found 291 instances of the reverse adapter sequence in fastq/SRR21190758_2.fastq
Found 2866 instances of the forward adapter sequence in fastq/SRR21190759_1.fastq
Found 365 instances of the reverse adapter sequence in fastq/SRR21190759_2.fastq
Found 2019 instances of the forward adapter sequence in fastq/SRR21190760_1.fastq
Found 227 instances of the reverse adapter sequence in fastq/SRR21190760_2.fastq
Found 1951 instances of the forward adapter sequence in fastq/SRR21190761_1.fastq
Found 187 instances of the reverse adapter sequence in fastq/SRR21190761_2.fastq
Found 1723 instances of the forward adapter sequence in fastq/SRR21190762_1.fastq
Found 207 instances of the reverse adapter sequence in fastq/SRR21190762_2.fastq
Found 2604 instances of the forward adapter sequence in fastq/SRR21190763_1.fastq
Found 333 instances of the reverse adapter sequence in fastq/SRR21190763_2.fastq
Found 4818 instances of the forward adapter sequence in fastq/SRR21190764_1.fastq
Found 794 instances of the reverse adapter sequence in fastq/SRR21190764_2.fastq
Found 6579 instances of the forward adapter sequence in fastq/SRR21190765_1.fastq
Found 1071 instances of the reverse adapter sequence in fastq/SRR21190765_2.fastq
Found 6366 instances of the forward adapter sequence in fastq/SRR21190766_1.fastq
Found 1169 instances of the reverse adapter sequence in fastq/SRR21190766_2.fastq
Found 5101 instances of the forward adapter sequence in fastq/SRR21190767_1.fastq
Found 920 instances of the reverse adapter sequence in fastq/SRR21190767_2.fastq
Found 8104 instances of the forward adapter sequence in fastq/SRR21190768_1.fastq
Found 1580 instances of the reverse adapter sequence in fastq/SRR21190768_2.fastq
Found 3275 instances of the forward adapter sequence in fastq/SRR21190769_1.fastq
Found 413 instances of the reverse adapter sequence in fastq/SRR21190769_2.fastq
Found 8945 instances of the forward adapter sequence in fastq/SRR21190770_1.fastq
Found 1371 instances of the reverse adapter sequence in fastq/SRR21190770_2.fastq
Found 4592 instances of the forward adapter sequence in fastq/SRR21190771_1.fastq
Found 697 instances of the reverse adapter sequence in fastq/SRR21190771_2.fastq
Found 4857 instances of the forward adapter sequence in fastq/SRR21190772_1.fastq
Found 882 instances of the reverse adapter sequence in fastq/SRR21190772_2.fastq
Found 4420 instances of the forward adapter sequence in fastq/SRR21190773_1.fastq
Found 715 instances of the reverse adapter sequence in fastq/SRR21190773_2.fastq
Found 4435 instances of the forward adapter sequence in fastq/SRR21190774_1.fastq
Found 637 instances of the reverse adapter sequence in fastq/SRR21190774_2.fastq
Found 3292 instances of the forward adapter sequence in fastq/SRR21190775_1.fastq
Found 565 instances of the reverse adapter sequence in fastq/SRR21190775_2.fastq

Looks like the adapter sequences I found are fine, so we can jump to trimming.

cutadapt.sh:

#!/bin/bash

# Adapter sequences
FORWARD_ADAPTER="AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA"
REVERSE_ADAPTER="AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG"

# Create a directory for the trimmed output if it doesn't exist
mkdir -p trimmed_fastq

# Loop through all the _1.fastq files in the fastq directory
for file in fastq/*_1.fastq; do
    # Identify the pair
    file1=$file
    file2=${file/_1.fastq/_2.fastq}

    # Generate output filenames
    out1=trimmed_fastq/$(basename "$file1" .fastq)_trimmed.fastq
    out2=trimmed_fastq/$(basename "$file2" .fastq)_trimmed.fastq

    # Run Cutadapt
    cutadapt \
        -a "$FORWARD_ADAPTER" -A "$REVERSE_ADAPTER" \
        -q 30 --pair-filter=any --minimum-length=30 \
        -o "$out1" -p "$out2" \
        "$file1" "$file2"
done

-q 30: This option trims low-quality ends from reads before adapter removal. The number 30 specifies the quality score threshold. In this case, bases with a quality score lower than 30 will be trimmed from the end of each read.

–pair-filter=any: This is used with paired-end reads. It indicates that if either the forward or reverse read of a pair needs to be discarded (for example, if after trimming it’s too short), then both reads of the pair will be discarded,thus ensuring that the output maintains properly paired reads.

–minimum-length=30: This specifies the minimum length of reads to keep after trimming. Reads that are shorter than 30 bases after trimming will be discarded. This is useful for filtering out very short sequences that may not be useful for further analysis.

4. Removal of residual rRNA annotated in SILVA with BBDuk

Removing residual ribosomal RNA (rRNA) from RNA sequencing samples is crucial because rRNA makes up the majority of cellular RNA, thereby diluting the less abundant but biologically informative messenger RNA (mRNA) and other non-coding RNAs.

First , we will need to download SSU (Small Subunit) and LSU (Large Subunit) Ref NR (Non-Redundant) datasets from silva. This can be done via silva_database_download.sh:

#!/bin/bash

# Downloading the SILVA LSU and SSU reference FASTA files
wget https://ftp.arb-silva.de/release_119.1/Exports/SILVA_119_LSURef_tax_silva.fasta.gz
wget https://ftp.arb-silva.de/release_119.1/Exports/SILVA_119.1_SSURef_Nr99_tax_silva.fasta.gz

# Decompressing the downloaded files
gunzip SILVA_119_LSURef_tax_silva.fasta.gz
gunzip SILVA_119.1_SSURef_Nr99_tax_silva.fasta.gz

And running bbduk.sh to actually remive residual rRNA:

#!/bin/bash

# Directory paths
TRIMMED_FASTQ_DIR="trimmed_fastq"
RRNA_REMOVED_DIR="rRNA_removed_fastq"
TEMP_DIR="temp_rRNA_removed"

# Reference databases (ensure these are in FASTA format)
LSU_DB="SILVA_119_LSURef_tax_silva.fasta"
SSU_DB="SILVA_119.1_SSURef_Nr99_tax_silva.fasta"

# Number of threads to use
THREADS=32

mkdir -p "$RRNA_REMOVED_DIR" "$TEMP_DIR"

# Function to run BBDuk for rRNA removal
run_bbduk() {
    local in1=$1
    local in2=$2
    local out1=$3
    local out2=$4
    local ref=$5

    bbduk.sh \
        in="$in1" \
        in2="$in2" \
        out="$out1" \
        out2="$out2" \
        ref="$ref" \
        k=31 \
        ktrim=n \
        mcf=0.5 \
        tbo \
        tpe \
        threads="$THREADS"
}

# Loop through all the trimmed FASTQ files in pairs
for file1 in "$TRIMMED_FASTQ_DIR"/*_1_trimmed.fastq; do
    file2="${file1%_1_trimmed.fastq}_2_trimmed.fastq"
    
    # Intermediate and final output filenames
    temp1="$TEMP_DIR/$(basename "$file1")"
    temp2="$TEMP_DIR/$(basename "$file2")"
    final1="$RRNA_REMOVED_DIR/$(basename "$file1" .fastq)_rRNA_removed.fastq"
    final2="$RRNA_REMOVED_DIR/$(basename "$file2" .fastq)_rRNA_removed.fastq"
    
    # Step 1: Remove LSU rRNA
    echo "Removing LSU rRNA for: $(basename "$file1") and $(basename "$file2")"
    run_bbduk "$file1" "$file2" "$temp1" "$temp2" "$LSU_DB"
    
    # Step 2: Remove SSU rRNA using the output from step 1 as input
    echo "Removing SSU rRNA for: $(basename "$file1") and $(basename "$file2")"
    run_bbduk "$temp1" "$temp2" "$final1" "$final2" "$SSU_DB"
    
    # Optional: Remove intermediate files
    rm "$temp1" "$temp2"
    
    echo "SSU and LSU rRNA removal done for: $(basename "$file1") and $(basename "$file2")"
done

echo "rRNA removal process completed."

in=“$in1”: Specifies the input file for the first mate in paired-end reads.

in2=“$in2”: Specifies the second mate in paired-end reads.

out=“$out1”: Specifies the output file for processed first mates.

out2=“$out2”: Specifies the output file for processed reads from the second input file in paired-end sequencing.

ref=“$ref”: This parameter specifies the reference sequences to be used for trimming or filtering the reads..

k=31: Defines the k-mer length to be used for matching against the reference sequences. This is important for the specificity and sensitivity of the matching process.

ktrim=n: Indicates that k-mers found in reads should not be trimmed off. Instead, reads will be left intact even if matching k-mers are found.

mcf=0.5: Sets the minimum coverage filter for trimming. It specifies the minimum fraction of bases that must match in a k-mer for it to be considered a valid match.

tbo: Stands for “trim by overlap”. This option enables trimming of adapters based on paired-end read overlap detection, which can improve adapter removal when adapter sequences are not known or specified.

tpe: Enables “trim paired ends”, which means that if adapters were trimmed from either read in a pair, the other read is trimmed by the same amount.

Now we can run multiqc again.

processed_multiqc.sh:

#!/bin/bash

# Directory containing FASTQ files
FASTQ_DIR="rRNA_removed_fastq/"

# Output directory for FastQC results
OUTPUT_DIR="processed_fastqc_results"

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR" multiqc

# Run FastQC on all FASTQ files in the directory
fastqc -o "$OUTPUT_DIR" -t 40 "$FASTQ_DIR"/*.fastq

# Run MultiQC to aggregate the FastQC reports
multiqc "$OUTPUT_DIR" -o processed_multiqc

5. Alignment with star

First, we should download hg38 assembly (used in the paper).

hg38.sh:

#!/bin/bash

wget ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Create an index with index.sh:

#!/bin/bash

GENOME_FASTA="Homo_sapiens.GRCh38.dna.primary_assembl$
ANNOTATION_GTF="GSE211993_gencode_fantomcat.v1.02.gen$
STAR_INDEX_DIR="star_index"

mkdir -p "$STAR_INDEX_DIR"

# Run STAR in genomeGenerate mode
STAR --runThreadN 32 \
    --runMode genomeGenerate \
    --genomeDir "$STAR_INDEX_DIR" \
    --genomeFastaFiles "$GENOME_FASTA" \
    --sjdbGTFfile "$ANNOTATION_GTF" \  
    --sjdbOverhang 99 
    
echo "Indexing complete."

–runThreadN 32: Number of threads

–runMode genomeGenerate: This tells STAR to run in genome indexing mode.

–genomeDir “$STAR_INDEX_DIR”: Specifies the directory where the generated genome indices will be stored.

–genomeFastaFiles “$GENOME_FASTA”: Specifies the path to the FASTA files containing the reference genome sequences.

–sjdbGTFfile “$ANNOTATION_GTF”: Specifies the path to the GTF file containing the genome annotation.

–sjdbOverhang 99: This parameter sets the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. The value should be set to the read length minus 1. In this case, 99 suggests that the read length is expected to be 100 bases.

run STAR with star.sh:

#!/bin/bash

# Define the number of threads to use
THREADS=32

# Path to the STAR genome index
GENOME_DIR="star_index"

# Directory containing rRNA removed FASTQ files
FASTQ_DIR="rRNA_removed_fastq"

# Output directory for the alignments
ALIGNMENT_DIR="alignments"
mkdir -p "$ALIGNMENT_DIR"

# Loop through the FASTQ files in pairs
for R1 in "$FASTQ_DIR"/*_1_trimmed_rRNA_removed.fastq; do
    # Infer the name of the R2 file from the name of the R1 file
    R2="${R1%_1_trimmed_rRNA_removed.fastq}_2_trimmed_rRNA_removed.fastq"

    # Extract the base name for output naming
    SAMPLE_NAME=$(basename "$R1" _1_trimmed_rRNA_removed.fastq)

    echo "Processing $SAMPLE_NAME"

    # Run STAR for alignment
    STAR --runMode alignReads \
         --runThreadN $THREADS \
         --genomeDir "$GENOME_DIR" \
         --readFilesIn "$R1" "$R2" \
         --outFileNamePrefix "${ALIGNMENT_DIR}/${SAMPLE_NAME}." \
         --outSAMtype BAM SortedByCoordinate

    echo "Alignment completed for $SAMPLE_NAME"
done

echo "All sample alignments have been completed."

–runMode alignReads: This specifies the operation mode of STAR. alignReads tells STAR to perform the read alignment against the prepared genome index.

–runThreadN Number of threads

–genomeDir “$GENOME_DIR”: This sets the path to the directory containing the genome index files that were previously generated with STAR in genomeGenerate mode.

–readFilesIn “\(R1" "\)R2”: Specifies the input files containing the RNA-seq reads.

–outFileNamePrefix “\({ALIGNMENT_DIR}/\){SAMPLE_NAME}.”: This parameter sets the prefix for all output files generated by STAR.

–outSAMtype BAM SortedByCoordinate: This tells STAR to output the alignment results in BAM format, sorted by genomic coordinates.

Checking mapping with MultiQC. mapping_quality.sh:

#!/bin/bash

# Define the directory containing STAR output files
STAR_OUTPUT_DIR="alignments"

# Define the directory where you want to save the MultiQC report
REPORT_DIR="$(pwd)/star_multiqc_report"

# Run MultiQC on the STAR outputs
echo "Running MultiQC on the STAR outputs..."
multiqc $STAR_OUTPUT_DIR -o $REPORT_DIR

Output:

We can also assess rRNA removal via RSeQC

#!/bin/bash

# Define the directory containing BAM files and where to output results
BAM_DIR="alignments"
OUTPUT_DIR="rRNA_overlap_results"
RIBO_BED="rRNA.bed"

# Create output directory if it does not exist
mkdir -p "$OUTPUT_DIR"

# Loop through each BAM file in the directory
for BAM_FILE in "$BAM_DIR"/*.bam; do
    # Extract the base name of the file for output naming
    BASE_NAME=$(basename "$BAM_FILE" .bam)
    
    # Perform bedtools intersect
    bedtools intersect -abam "$BAM_FILE" -b "$RIBO_BED" -bed -wa | wc -l > "$OUTPUT_DIR/${BASE_NAME}_rRNA_count.txt"
    
    echo "Processed $BAM_FILE"
done

echo "All files processed. rRNA read counts are in $OUTPUT_DIR."

The following table compares number of residual RNA before and after removal:

Filename    RNA Before  RNA After
SRR21190756.Aligned.sortedByCoord.out_rRNA_count.txt    954 156
SRR21190757.Aligned.sortedByCoord.out_rRNA_count.txt    681 118
SRR21190758.Aligned.sortedByCoord.out_rRNA_count.txt    930 102
SRR21190759.Aligned.sortedByCoord.out_rRNA_count.txt    661 63
SRR21190760.Aligned.sortedByCoord.out_rRNA_count.txt    548 62
SRR21190761.Aligned.sortedByCoord.out_rRNA_count.txt    172 42
SRR21190762.Aligned.sortedByCoord.out_rRNA_count.txt    1159    293
SRR21190763.Aligned.sortedByCoord.out_rRNA_count.txt    1373    208
SRR21190764.Aligned.sortedByCoord.out_rRNA_count.txt    1017    260
SRR21190765.Aligned.sortedByCoord.out_rRNA_count.txt    1251    258

Looks like residual rRNA removal was quite good!

6. FeatureCounts

featurecounts.sh:

#!/bin/bash

# Path to the directory containing BAM files
BAM_DIR="alignments"

# Path to the GTF annotation file
ANNOTATION_FILE="GSE211993_gencode_fantomcat.v1.02.genes_transcripts_exons.gtf"

# Output directory for featureCounts results
OUTPUT_DIR="featureCounts_results"
mkdir -p "$OUTPUT_DIR"

# Output file name
OUTPUT_FILE="${OUTPUT_DIR}/featureCounts_results.txt"

# Run featureCounts with the specified options
featureCounts -a "$ANNOTATION_FILE" \
              -o "$OUTPUT_FILE" \
              -F GTF \
              -s 2 \
              -p \
              --countReadPairs \
              -C \
              -B \
              --extraAttributes gene_name \
              --minOverlap 90 \
              --fracOverlap 0.9 \
              -T 16 \
              "$BAM_DIR"/*.bam

echo "featureCounts has completed."

-a “$ANNOTATION_FILE”: Specifies the gene annotation file, which contains information about gene features.

-o “$OUTPUT_FILE”: Designates the output file where the counts of reads aligned to features (genes) will be saved.

-F GTF: Sets the format of the annotation file to GTF (Gene Transfer Format).

-s 2: Indicates that the script should count only reverse-stranded reads (useful for strand-specific RNA-seq data).

-p: Instructs featureCounts to count fragments instead of individual reads, assuming paired-end RNA-seq data.

–countReadPairs: Ensures that read pairs are counted only once, even if both reads in a pair are mapped.

-C: Disallows counting of reads that have their two ends mapping to different chromosomes or mapping to the same chromosome but on opposite strands.

-B: Requires both reads in a pair to be successfully aligned before counting them (for paired-end data).

–extraAttributes gene_name: Includes additional gene attributes (in this case, gene_name) in the output file for better readability and interpretation.

–minOverlap 90: Sets the minimum number of bases that must overlap with a feature for a read to be considered for counting (90 bases in this case).

–fracOverlap 0.9: Specifies that at least 90% of a read must overlap with a feature to be counted.

-T 16: Allocates 16 CPU threads to be used for processing, speeding up the counting process.

7. Gene count summary

Important information:I will be looking only at L150P, I will get rid of A79V samples and leave only 10 L150P samples, 5 control and 5 with familial AD. First lets load all necessary libraries:

# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(stringr)
library(DESeq2)
## Warning: package 'DESeq2' was built under R version 4.3.3
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:plotly':
## 
##     rename
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:plotly':
## 
##     slice
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Warning: package 'GenomeInfoDb' was built under R version 4.3.3
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following object is masked from 'package:dplyr':
## 
##     count
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
library(ggrepel)
library(rtracklayer)
library(pheatmap)
library(clusterProfiler)
## Warning: package 'clusterProfiler' was built under R version 4.3.3
## 
## clusterProfiler v4.10.1  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
## 
## If you use clusterProfiler in published research, please cite:
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
## 
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:IRanges':
## 
##     slice
## The following object is masked from 'package:S4Vectors':
## 
##     rename
## The following object is masked from 'package:stats':
## 
##     filter
library(AnnotationDbi)
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:clusterProfiler':
## 
##     select
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
library(ensembldb)  # Loaded only once
## Warning: package 'ensembldb' was built under R version 4.3.3
## Loading required package: GenomicFeatures
## Warning: package 'GenomicFeatures' was built under R version 4.3.3
## Loading required package: AnnotationFilter
## 
## Attaching package: 'ensembldb'
## The following object is masked from 'package:clusterProfiler':
## 
##     filter
## The following object is masked from 'package:plotly':
## 
##     filter
## The following object is masked from 'package:dplyr':
## 
##     filter
## The following object is masked from 'package:stats':
## 
##     filter
library(EnsDb.Hsapiens.v86)
library(org.Hs.eg.db)
## 
library(DT)
library(AnnotationHub)
## Warning: package 'AnnotationHub' was built under R version 4.3.3
## Loading required package: BiocFileCache
## Warning: package 'BiocFileCache' was built under R version 4.3.3
## Loading required package: dbplyr
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
## 
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:rtracklayer':
## 
##     hubUrl
## The following object is masked from 'package:Biobase':
## 
##     cache

Summary plot is interactive and can be zoomed in/out

# Read the gene count summary and metadata files
gene_count_summary <- read.table("count_table.txt.summary", header = TRUE, stringsAsFactors = FALSE)
metadata <- read.csv("metadata.txt", stringsAsFactors = FALSE)

# Mapping and column name update with vectorized operations
metadata$genotype <- gsub("- gene corrected", "GC", metadata$genotype)
srr_to_genotype <- setNames(metadata$genotype, metadata$Run)

# Update column names in gene_count_summary based on SRR numbers
colnames(gene_count_summary)[-1] <- sapply(colnames(gene_count_summary)[-1], function(name) {
  srr_number <- regmatches(name, regexpr("SRR[0-9]+", name))
  if(length(srr_number) > 0 && !is.na(srr_to_genotype[srr_number])) srr_to_genotype[srr_number] else name
})

# Function to rename duplicates with suffixes
rename_duplicates <- function(names) {
  # Create a counter for each unique name
  counts <- rep(0, length(names))
  names_with_counts <- names
  for (i in seq_along(names)) {
    counts[i] <- sum(names[1:i] == names[i])
    if (counts[i] > 1) {
      names_with_counts[i] <- paste(names[i], counts[i], sep = "_")
    }
  }
  names_with_counts
}

# Applying the function to rename columns
names(gene_count_summary) <- rename_duplicates(names(gene_count_summary))


# Reorder the dataframe columns based on sorted names directly
gene_count_summary <- gene_count_summary[, c("Status", sort(names(gene_count_summary)[-1]))]

# Filter rows before melting to reduce the amount of data processed
gene_count_summary_filtered <- gene_count_summary %>% 
   dplyr::filter(Status %in% c("Assigned", "Unassigned_MultiMapping", "Unassigned_NoFeatures", "Unassigned_Overlapping_Length"))

# Melt the data
gene_count_long <- melt(gene_count_summary_filtered, id.vars = "Status", variable.name = "Sample", value.name = "Count")

# Create a unique identifier for each replicate
gene_count_long <- gene_count_long %>%
  group_by(Sample) %>%
  mutate(Replicate = paste(Sample, row_number(), sep = "_")) %>%
  ungroup()

# As I will be looking only at L150P, I will get rid of A79V samples and leave only 5 L150P samples
filtered_gene_count_long <- gene_count_long %>%
  dplyr::filter(!str_detect(Sample, "A79V")) %>%
  dplyr::slice(-c(1:4)) %>%
  dplyr::slice(-c(5:20)) %>%
  dplyr::slice(-c(21:24)) %>%
  dplyr::slice(-c(25:40))

filtered_gene_count_long <- filtered_gene_count_long %>%
  mutate(
    Sample = str_replace(Sample, "(\\d+)$", function(x) as.character(as.numeric(x) - 1)),
  )

# Plotting
p <- ggplot(filtered_gene_count_long, aes(y = Sample, x = Count, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge", orientation = "y") +
  theme_minimal() +
  labs(title = "Counts for Each Replicate and Status", y = "Replicate", x = "Count") +
  scale_fill_brewer(palette = "Set3") # Adjust palette as needed

ggsave(filename = "key_data_sets/counts_for_each_replicate.png", plot = p, width = 10, height = 8, dpi = 300)

# Convert ggplot object to an interactive plotly object
ggplotly(p)

Figure 1. The majority of counts for each replicate fall under the ‘Assigned’ category, indicating successful mapping to known genomic features, while the other categories represent various types of reads that could not be uniquely or properly assigned. This distribution suggests a consistent pattern of read assignment across the replicates, with a relatively small fraction of reads being unassigned or ambiguously mapped.

8. Now lets process the count table and plot essential graphs

# Read the gene count summary and metadata files
count_table <- read.table("count_table.txt", header = TRUE, stringsAsFactors = FALSE)

# mapping and column name update with vectorized operations
metadata$genotype <- gsub("- gene corrected", "GC", metadata$genotype)
srr_to_genotype <- setNames(metadata$genotype, metadata$Run)

# Update column names based on SRR numbers
colnames(count_table)[7:ncol(count_table)] <- sapply(colnames(count_table)[7:ncol(count_table)], function(name) {
  srr_number <- regmatches(name, regexpr("SRR[0-9]+", name))
  if(length(srr_number) > 0 && !is.na(srr_to_genotype[srr_number])) {
    srr_to_genotype[srr_number]
  } else {
    name
  }
})


# Extract the gene identifiers (or names) to use as row names
gene_ids <- count_table[, 1]

# Extract the count data, excluding the first column and non-count columns
count_data <- count_table[, 8:ncol(count_table)] 

# Convert to matrix if it's not already one
count_data_matrix <- as.matrix(count_data)

# Assign gene identifiers as row names to the matrix
rownames(count_data_matrix) = gene_ids

count_data_matrix <- count_data_matrix[, order(colnames(count_data_matrix))]

# As I will be looking only at L150P, I will get rid of A79V samples and first 5 L150P samples
filtered_count_data_matrix <- count_data_matrix[, -c(1:20)]
filtered_count_data_matrix <- filtered_count_data_matrix[, -c(1:5)]
filtered_count_data_matrix <- filtered_count_data_matrix[, -c(6:10)]
colData <- data.frame(
  condition = factor(c("L150P", "L150P", "L150P","L150P","L150P", "Control", "Control","Control","Control","Control" ))
)
rownames(colData) = colnames(filtered_count_data_matrix)

Running DESeq2 on L150P and Control samples

dds<- DESeqDataSetFromMatrix(
  countData = filtered_count_data_matrix,
  colData = colData,
  design = ~ condition
)

dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
results <- results(dds)
results <- na.omit(results)

MA Plot

plotMA(results, main="MA-plot", ylim=c(-10,10))

Figure 2. The blue points represent genes that are significantly differentially expressed, with their dispersion reflecting the strength and direction of differential expression. The majority of genes do not exhibit significant changes in expression (grey points), indicating stable expression across the conditions compared. The clustering of blue points away from the zero log fold change line, especially for genes with higher mean expression levels, suggests a set of genes with notable expression differences that may be biologically significant in the context of the L150P variant.

PCA plot:

rld <- rlogTransformation(dds)

pcaData <- plotPCA(rld, intgroup="condition", returnData=TRUE)
## using ntop=500 top features by variance
pcaData$sample <- rownames(pcaData)

ggplot(pcaData, aes(x=PC1, y=PC2, label=sample)) +
  geom_point(aes(color=condition), size=2) +
  geom_text_repel(aes(color=condition), size=3) +
  theme_minimal() +
  ggtitle("PCA of rlog-transformed counts with Sample Labels") +
  xlab(paste0("PC1 (", round(100 * attr(pcaData, "percentVar")[1]), "% variance)")) +
  ylab(paste0("PC2 (", round(100 * attr(pcaData, "percentVar")[2]), "% variance)")) +
  scale_color_manual(values=c("Condition1"="blue", "Condition2"="red"))
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

Figure 3. The PCA plot illustrates the separation between the control group and the L150P variant samples, with principal component 1 (PC1) accounting for 98% of the variance. The distinct clustering of the L150P samples away from the control on PC1 suggests a significant impact of the L150P mutation on the overall gene expression profile. The variance captured by PC2 is relatively small, indicating that most of the expression variation is explained by PC1. This separation supports the potential of the L150P mutation to drive distinct transcriptional changes.

Number of Differentially expressed genes:

res<- results(dds)

significant_genes <- res[which(res$padj <0.05), ]

significant_fold_change_genes<- significant_genes[which(abs(significant_genes$log2FoldChange)>1),]

Summary of the results (!!!gtf file weights around 1GB which is why I was not able to upload it on GitHub, you can download it from HERE, unzip and change file/variable name!!!):

gtf_path <- "genome_annotation.gtf"
gtf_data <- import(gtf_path, format = "gtf")

# Create a mapping table of CATG IDs to ref_gene values
gene_mapping <- gtf_data %>%
  as.data.frame() %>%
  dplyr::filter(!is.na(gene_id) & !is.na(ref_gene)) %>%
  dplyr::select(gene_id, ref_gene) %>%
  distinct()

# Convert DESeqResults to DataFrame
res_df <- as.data.frame(res)

# Add the gene IDs as a new column 
res_df$gene_id <- rownames(res_df)

# Merge with gene_mapping to get the ref_gene values for CATG IDs
mapped_res_df <- merge(res_df, gene_mapping, by.x = "gene_id", by.y = "gene_id", all.x = TRUE)

# If there is no matching ref_gene, keep the original CATG ID
mapped_res_df$final_gene_id <- ifelse(is.na(mapped_res_df$ref_gene), mapped_res_df$gene_id, mapped_res_df$ref_gene)

# Filter for significant genes and significant fold changes
significant_genes <- mapped_res_df[which(mapped_res_df$padj < 0.05), ]
significant_fold_change_genes <- significant_genes[which(abs(significant_genes$log2FoldChange) > 1), ]

# Convert to a data frame
significant_fold_change_genes_df <- as.data.frame(significant_fold_change_genes)

# Reverse the data frame to prioritize later rows when removing duplicates
significant_fold_change_genes_df <- significant_fold_change_genes_df[rev(seq_len(nrow(significant_fold_change_genes_df))),]

# Remove duplicate gene_ids while keeping the last occurrence (now the first because we reversed the dataset)
significant_fold_change_genes_df <- significant_fold_change_genes_df[!duplicated(significant_fold_change_genes_df$gene_id),]

# Re-reverse the data frame to restore original order
significant_fold_change_genes_df <- significant_fold_change_genes_df[rev(seq_len(nrow(significant_fold_change_genes_df))),]

# Update gene_id conditionally
significant_fold_change_genes_df <- significant_fold_change_genes_df %>%
  mutate(gene_id = if_else(grepl("^CATG", gene_id), final_gene_id, gene_id))

# Filter out rows where gene_id contains "-", and then select only the required columns
significant_fold_change_genes_df <- significant_fold_change_genes_df %>%
  dplyr::filter(!grepl("-", gene_id)) %>%
  dplyr::select(-ref_gene, -final_gene_id)

# Sort the DataFrame by absolute log2FoldChange in descending order
top_differentially_expressed_genes <- significant_fold_change_genes_df %>%
  arrange(desc(abs(log2FoldChange)))

cat("\nNumber of significant genes with |log2FoldChange| > 1: ", nrow(significant_fold_change_genes_df), "\n")
## 
## Number of significant genes with |log2FoldChange| > 1:  4384
write.csv(significant_fold_change_genes_df, 
          file = "key_data_sets/significant_fold_change_genes.csv", 
          row.names = FALSE)

Volcano plot:

plot(significant_genes$log2FoldChange, -log10(significant_genes$pvalue),
     pch=20, main="Volcano Plot", xlab="Log2 Fold Change", ylab="-Log10 p-value")
points(significant_fold_change_genes$log2FoldChange, -log10(significant_fold_change_genes$pvalue),
       col="red", pch=20)

Figure 4. The density of points in the significant regions suggests a substantial number of genes are differentially expressed.

Heatmap:

significant_genes <- res[which(res$padj <0.05), ]
significant_fold_change_genes<- significant_genes[which(abs(significant_genes$log2FoldChange)>1),]

selected_genes <- rownames(significant_fold_change_genes)
rlog_counts <- assay(rlogTransformation(dds))
pheatmap(rlog_counts[selected_genes, ], cluster_rows=TRUE, show_rownames=FALSE,
         cluster_cols=TRUE, annotation_col=as.data.frame(colData))

Figure 5. This figure illustrates the expression patterns of a set of genes across various samples under two different conditions: Control and L150P. The color intensity within the heatmap represents the expression level of genes, with blue indicating lower expression, yellow representing medium expression, and red signifying higher expression levels. The hierarchical clustering on the left groups genes with similar expression profiles, while the clustering on top organizes samples into clusters based on their overall expression similarities. The annotated bar above the heatmap categorizes the samples by condition, providing a clear visual distinction between the Control (light blue) and L150P (pink) groups. The dendrograms’ branch lengths depict the degree of similarity, with shorter branches indicating closer relationships.

GO Enrichment analysis

Preparation for actual analysis:

gene_list <- significant_fold_change_genes_df$gene_id


gene_list <- sapply(gene_list, function(x) {
  if(grepl("^ENSG", x)) {  # Check if the gene ID starts with "ENSG"
    sub("\\..*$", "", x)  # Remove dot and everything after
  } else {
    x
  }
}, USE.NAMES = FALSE)

Identification of protein-coding genes:

hub = AnnotationHub()
edb = hub[["AH75011"]]
## loading from cache
keys = keys(edb, "GENENAME")
columns =  c("GENEID", "GENEBIOTYPE")
tbl =
    ensembldb::select(edb, keys, columns, keytype = "GENENAME") %>%
    as_tibble()

Illustrating categories of significant logFC genes:

gene_list_df <- data.frame(GENEID = gene_list)
matched_genes <- merge(gene_list_df, tbl, by = "GENEID")
biotype_counts <- table(matched_genes$GENEBIOTYPE)

write.table(biotype_counts, file = "key_data_sets/biotype_counts.txt", quote = FALSE, sep = "\t", col.names = NA)


print(biotype_counts)
## 
##                             lncRNA                              miRNA 
##                               1109                                  1 
##               processed_pseudogene                     protein_coding 
##                                128                               2942 
##                         pseudogene                              snRNA 
##                                  1                                  2 
##                                TEC                          TR_C_gene 
##                                 51                                  1 
##                          TR_V_gene   transcribed_processed_pseudogene 
##                                  1                                 28 
##     transcribed_unitary_pseudogene transcribed_unprocessed_pseudogene 
##                                  7                                 76 
##             unprocessed_pseudogene 
##                                 32
biotype_data <- data.frame(
  GeneBiotype = names(biotype_counts),
  Count = as.numeric(biotype_counts)
)

ggplot(biotype_data, aes(x = 1, y = Count, fill = GeneBiotype)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  theme_void() +
  labs(fill = "Gene Biotype", title = "Distribution of Gene Biotypes")

Figure 6. This pie chart presents the distribution of gene biotypes in a given dataset. The majority of the genes fall into two primary categories, with the largest segment in green representing ‘protein_coding’ genes, which are responsible for creating proteins that carry out various functions in the cell. The second largest category, in red, signifies ‘lncRNA’ genes, which are long non-coding RNAs involved in the regulation of gene expression and chromatin dynamics. The remaining slivers of the pie chart indicate other biotypes, including ‘miRNA’ (microRNAs), ‘processed_pseudogene’, ‘pseudogene’, ‘snRNA’ (small nuclear RNAs), ‘TEC’ (to be experimentally confirmed), ‘TR_C_gene’ (T-cell receptor constant genes), ‘TR_V_gene’ (T-cell receptor variable genes), ‘transcribed_processed_pseudogene’, and ‘transcribed_unitary_pseudogene’, each contributing to a smaller proportion of the total gene distribution.

Extracting only protein-coding genes and illustrating new top differentially expressed genes table with gene-names:

protein_coding_genes <- subset(matched_genes, GENEBIOTYPE == 'protein_coding')

gene_list<- protein_coding_genes$GENEID

# Assuming 'top_differentially_expressed_genes' is sorted and has the top entries
top_differentially_expressed_genes <- significant_fold_change_genes_df %>%
  arrange(desc(abs(log2FoldChange))) %>%
  slice_head(n = 20)  %>%
  mutate(gene_id = sub("\\..*", "", gene_id))

# Perform a left join with protein_coding_genes to add the GENENAME
top_differentially_expressed_genes <- top_differentially_expressed_genes %>%
  left_join(protein_coding_genes, by = c("gene_id" = "GENEID")) %>%
  mutate(GENENAME = ifelse(is.na(GENENAME), "NA", GENENAME))

# Select and rearrange columns
columns_to_remove <- c("GENEBIOTYPE")
if ("GENEBIOTYPE" %in% names(top_differentially_expressed_genes)) {
  top_differentially_expressed_genes <- dplyr::select(top_differentially_expressed_genes, -all_of(columns_to_remove))
}

# Place 'GENENAME' before 'baseMean'
top_differentially_expressed_genes <- top_differentially_expressed_genes %>%
  relocate(GENENAME, .after = gene_id)

top_differentially_expressed_genes <- top_differentially_expressed_genes %>%
  dplyr::filter(GENENAME != "NA")

write.csv(top_differentially_expressed_genes, 
          file = "key_data_sets/top_differentially_expressed_genes.csv", 
          row.names = FALSE)

# Print the updated dataframe
print(top_differentially_expressed_genes)
##            gene_id  GENENAME   baseMean log2FoldChange     lfcSE       stat
## 1  ENSG00000135903      PAX3  119.81763       9.694614 0.9175215  10.566090
## 2  ENSG00000118271       TTR 3510.06238       9.576438 0.2697968  35.495006
## 3  ENSG00000008196    TFAP2B  743.84738       9.335211 0.4387665  21.276037
## 4  ENSG00000164600   NEUROD6  535.67684       8.620576 0.4120384  20.921780
## 5  ENSG00000164707   SLC13A4   56.69134       8.615710 0.9325180   9.239190
## 6  ENSG00000197901   SLC22A6   32.39568       8.389223 0.9574148   8.762370
## 7  ENSG00000143768    LEFTY2 1002.51074      -8.166329 0.2571610 -31.755713
## 8  ENSG00000135406      PRPH  246.71309      -7.929730 0.4443326 -17.846384
## 9  ENSG00000111432     FZD10   32.81085       7.829345 0.9426798   8.305412
## 10 ENSG00000143032    BARHL2   62.97504       7.641543 0.8588563   8.897348
## 11 ENSG00000235718      MFRP   28.65649       7.630850 0.9397187   8.120356
## 12 ENSG00000137809    ITGA11  827.31845      -7.605255 0.2412512 -31.524224
## 13 ENSG00000205221       VIT   27.76443       7.584131 0.9285841   8.167414
## 14 ENSG00000166634 SERPINB12   27.63697       7.570547 0.9271651   8.165264
## 15 ENSG00000125084      WNT1   56.00444       7.476892 0.8570408   8.724078
## 16 ENSG00000123307   NEUROD4   86.76264       7.368823 0.6626509  11.120219
##           pvalue          padj
## 1   4.279697e-26  7.865586e-25
## 2  5.869574e-276 2.724798e-273
## 3  1.892810e-100  1.481509e-98
## 4   3.392162e-97  2.523038e-95
## 5   2.483713e-20  3.620681e-19
## 6   1.911881e-18  2.524640e-17
## 7  2.648005e-221 7.584843e-219
## 8   3.083695e-71  1.560686e-69
## 9   9.947330e-17  1.192484e-15
## 10  5.719671e-19  7.746587e-18
## 11  4.648177e-16  5.366731e-15
## 12 4.045702e-218 1.111536e-215
## 13  3.150706e-16  3.667176e-15
## 14  3.207335e-16  3.730346e-15
## 15  2.683569e-18  3.510938e-17
## 16  1.000259e-28  1.970151e-27
print.data.frame(top_differentially_expressed_genes, max = NULL)
##            gene_id  GENENAME   baseMean log2FoldChange     lfcSE       stat
## 1  ENSG00000135903      PAX3  119.81763       9.694614 0.9175215  10.566090
## 2  ENSG00000118271       TTR 3510.06238       9.576438 0.2697968  35.495006
## 3  ENSG00000008196    TFAP2B  743.84738       9.335211 0.4387665  21.276037
## 4  ENSG00000164600   NEUROD6  535.67684       8.620576 0.4120384  20.921780
## 5  ENSG00000164707   SLC13A4   56.69134       8.615710 0.9325180   9.239190
## 6  ENSG00000197901   SLC22A6   32.39568       8.389223 0.9574148   8.762370
## 7  ENSG00000143768    LEFTY2 1002.51074      -8.166329 0.2571610 -31.755713
## 8  ENSG00000135406      PRPH  246.71309      -7.929730 0.4443326 -17.846384
## 9  ENSG00000111432     FZD10   32.81085       7.829345 0.9426798   8.305412
## 10 ENSG00000143032    BARHL2   62.97504       7.641543 0.8588563   8.897348
## 11 ENSG00000235718      MFRP   28.65649       7.630850 0.9397187   8.120356
## 12 ENSG00000137809    ITGA11  827.31845      -7.605255 0.2412512 -31.524224
## 13 ENSG00000205221       VIT   27.76443       7.584131 0.9285841   8.167414
## 14 ENSG00000166634 SERPINB12   27.63697       7.570547 0.9271651   8.165264
## 15 ENSG00000125084      WNT1   56.00444       7.476892 0.8570408   8.724078
## 16 ENSG00000123307   NEUROD4   86.76264       7.368823 0.6626509  11.120219
##           pvalue          padj
## 1   4.279697e-26  7.865586e-25
## 2  5.869574e-276 2.724798e-273
## 3  1.892810e-100  1.481509e-98
## 4   3.392162e-97  2.523038e-95
## 5   2.483713e-20  3.620681e-19
## 6   1.911881e-18  2.524640e-17
## 7  2.648005e-221 7.584843e-219
## 8   3.083695e-71  1.560686e-69
## 9   9.947330e-17  1.192484e-15
## 10  5.719671e-19  7.746587e-18
## 11  4.648177e-16  5.366731e-15
## 12 4.045702e-218 1.111536e-215
## 13  3.150706e-16  3.667176e-15
## 14  3.207335e-16  3.730346e-15
## 15  2.683569e-18  3.510938e-17
## 16  1.000259e-28  1.970151e-27

Figure 7. This figure illustates top differentially expressed protein-coding genes alongside with gene names and other valuable statistics.

Biological Processes:

ego <- enrichGO(gene         = gene_list,
                OrgDb         = org.Hs.eg.db,
                keyType      = "ENSEMBL",
                ont          = "BP",
                pAdjustMethod = "BH",
                qvalueCutoff  = 0.05,
                readable      = TRUE)

ego_df <- as.data.frame(ego)

datatable(ego_df, options = list(pageLength = 5), 
          caption = 'Top GO Terms from Enrichment Analysis (BP)')
write.csv(ego_df, 
          file = "key_data_sets/enriched_BP_GO_terms.csv", 
          row.names = FALSE)

dotplot(ego)

Figure 8. The given chart illustrates that processes related to the extracellular matrix and organ development, such as “extracellular structure organization”, “external encapsulating structure organization” and “extracellular matrix organization” are notably enriched.

Cellular component:

ego <- enrichGO(gene         = gene_list,
                OrgDb         = org.Hs.eg.db,
                keyType      = "ENSEMBL",
                ont          = "CC",
                pAdjustMethod = "BH",
                qvalueCutoff  = 0.05,
                readable      = TRUE)

ego_df <- as.data.frame(ego)

datatable(ego_df, options = list(pageLength = 5), 
          caption = 'Top GO Terms from Enrichment Analysis (CC)')
write.csv(ego_df, 
          file = "key_data_sets/enriched_CC_GO_terms.csv", 
          row.names = FALSE)

dotplot(ego)

Figure 9. The plot graphically summarizes gene enrichment in cellular components, with a notable focus on structural elements such as the collagen-containing extracellular matrix, basement membrane andendoplasmic reticulum lumen .

Molecular function:

ego <- enrichGO(gene         = gene_list,
                OrgDb         = org.Hs.eg.db,
                keyType      = "ENSEMBL",
                ont          = "MF",
                pAdjustMethod = "BH",
                qvalueCutoff  = 0.05,
                readable      = TRUE)

ego_df <- as.data.frame(ego)

datatable(ego_df, options = list(pageLength = 5), 
          caption = 'Top GO Terms from Enrichment Analysis (MF)')
write.csv(ego_df, 
          file = "key_data_sets/enriched_MF_GO_terms.csv", 
          row.names = FALSE)

dotplot(ego)

Figure 10. The plot visualizes molecular function enrichment, showing a range of activities from ion channel activity to various binding interactions. The chart suggests that the analyzed genes have significant roles in ion channel functions, such as “monoatomic ion channel activity” and “gated channel activity,” as well as binding to components of the extracellular matrix like glycosaminoglycan and heparin.

Discussion

I believe some of my script or code chunks could be rewritten in a simpler and faster manner. Sometimes, I overthink, and a small, easy problem turns into 50 lines of code. However, I think this is acceptable since quality is more important than speed.

The biggest challenge in this project was working with the custom annotation file provided by the original paper’s authors, which combines GENCODE and FANTOM-CAT annotations. Overall, it follows the standard GFF2 protocol, but not all gene IDs are formatted consistently. Most of them are Ensembl identifiers, but approximately 15-20% are CAGE identifiers (starting with CATG). CAGE identifiers are not widely used, resulting in a lack of tools or libraries in R for mapping CAGE to Ensembl identifiers. I had to import the annotation file and write code that would extract reference gene IDs and map CAGE ones to Ensembl in order to analyze my dataset properly.

Another issue was identifying protein-coding genes, separating them from non-coding ones, and adding gene names to my dataframe (see chunks 33-35). This step was crucial for the correct interpretation and analysis of the data, as it allowed for more accurate downstream analyses, such as differential gene expression and enrichment analysis. Correctly identifying and annotating these genes ensured the biological relevance of the findings and enhanced the credibility of the study.

Key Data Sets

File Location Description
counts_for_each_replicate.png key_data_sets Image showing the count for each replicate, categorized by status.
biotype_counts.txt key_data_sets Text file listing the number of each gene biotype in the significant genes dataset (fold change > abs 1).
top_differentially_expressed_genes.csv key_data_sets CSV file documenting the most prevalent protein-coding gene IDs along with their respective gene names.
significant_fold_change_genes.csv key_data_sets CSV file documenting ALL significant genes with a fold change greater than abs 1).
enriched_BP_GO_terms.csv key_data_sets CSV file containing all Biological Process Gene Ontology (GO) terms identified in the protein-coding dataset.
enriched_CC_GO_terms.csv key_data_sets CSV file containing all Cellular Component Gene Ontology (GO) terms identified in the protein-coding dataset.
enriched_MF_GO_terms.csv key_data_sets CSV file containing all Molecular Function Gene Ontology (GO) terms identified in the protein-coding dataset.

Citations

[1] Corsi, G. I., Gadekar, V. P., Haukedal, H., Doncheva, N. T., Anthon, C., Ambardar, S., Palakodeti, D., Hyttel, P., Freude, K., Seemann, S. E., & Gorodkin, J. (2023, March). The transcriptomic landscape of neurons carrying PSEN1 mutations reveals changes in extracellular matrix components and non-coding gene expression. Neurobiology of Disease, 178, 105980. https://doi.org/10.1016/j.nbd.2022.105980

[2] Bagaria, J., Bagyinszky, E., & An, S. S. A. (2022, September 19). Genetics, Functions, and Clinical Impact of Presenilin-1 (PSEN1) Gene. International Journal of Molecular Sciences, 23(18), 10970. https://doi.org/10.3390/ijms231810970

[3] Poon, A., Schmid, B., Pires, C., Nielsen, T. T., Hjermind, L. E., Nielsen, J. E., Holst, B., Hyttel, P., & Freude, K. K. (2016, November). Generation of a gene-corrected isogenic control hiPSC line derived from a familial Alzheimer’s disease patient carrying a L150P mutation in presenilin 1. Stem Cell Research, 17(3), 466–469. https://doi.org/10.1016/j.scr.2016.09.018